This model is a fine-tuned version of microsoft/Florence-2-large, trained on a subset of 40,000 images from the Ejafa/ye-pop dataset, with annotations generated by THUDM/cogvlm2-llama3-chat-19B, focusing on image-to-text tasks.
Image-to-Text
Transformers Supports Multiple Languages